4 Why Self-Attention
Motivating our use of self-attention we consider three desiderata.
One is the total computational complexity per layer.
Table 1(rについては下で別途取り上げ)
r the size of the neighborhood in restricted self-attention.
Another is the amount of computation that can be parallelized
The third is the path length between long-range dependencies in the network.
One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network.
The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies
「入力系列と出力系列における位置の任意の組合せの間のこれらのパスが短いほど、long-rangeの依存は学習しやすくなる(とされる)」(Reference 12)
1点目に関して、Table 1のr
To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position.
As side benefit, self-attention could yield more interpretable models.